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Abstract 



In this paper we develop a new approach to sparse principal component analysis (sparse PCA). 
We propose two single-unit and two block optimization formulations of the sparse PCA problem, 
aimed at extracting a single sparse dominant principal component of a data matrix, or more com- 
ponents at once, respectively. While the initial formulations involve nonconvex functions, and are 
therefore computationally intractable, we rewrite them into the form of an optimization program in- 
volving maximization of a convex function on a compact set. The dimension of the search space is 
decreased enormously if the data matrix has many more columns (variables) than rows. We then pro- 
pose and analyze a simple gradient method suited for the task. It appears that our algorithm has best 
convergence properties in the case when either the objective function or the feasible set are strongly 
convex, which is the case with our single-unit formulations and can be enforced in the block case. 
Finally, we demonstrate numerically on a set of random and gene expression test problems that our 
approach outperforms existing algorithms both in quality of the obtained solution and in computa- 
tional speed. 
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1 Introduction 

Principal component analysis (PCA) is a well established tool for making sense of high dimensional 
data by reducing it to a smaller dimension. It has applications virtually in all areas of science — machine 
learning, image processing, engineering, genetics, neurocomputing, chemistry, meteorology, control the- 
ory, computer networks — to name just a few — where large data sets are encountered. It is important that 
having reduced dimension, the essential characteristics of the data are retained. If yl G j^pxn ^ matrix 
encoding p samples of n variables, with n being large, PCA aims at finding a few linear combinations of 
these variables, called principal components, which point in orthogonal directions explaining as much of 
the variance in the data as possible. If the variables contained in the columns of A are centered, then the 
classical PCA can be written in terms of the scaled sample covariance matrix S = AA^ as follows: 



Extracting one component amounts to computing the dominant eigenvector of S (or, equivalently, 
dominant right singular vector of A). Full PCA involves the computation of the singular value decom- 
position (SVD) of A. Principal components are, in general, combinations of all the input variables, i.e. 
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Find z* = are max z^YiZ. 

zTz<l 
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the loading vector z* is not expected to have many zero coefficients. In most applications, however, 
the original variables have concrete physical meaning and PCA then appears especially interpretable if 
the extracted components are composed only from a small number of the original variables. In the case 
of gene expression data, for instance, each variable represents the expression level of a particular gene. 
A good analysis tool for biological interpretation should be capable to highlight "simple" structures in 
the genome — structures expected to involve a few genes only — that explain a significant amount of the 
specific biological processes encoded in the data. Components that are linear combinations of a small 
number of variables are, quite naturally, usually easier to interpret. It is clear, however, that with this 
additional goal, some of the explained variance has to be sacrificed. The objective of sparse principal 
component analysis (sparse PCA) is to find a reasonable trade-off between these conflicting goals. One 
would like to explain as much variability in the data as possible, using components constructed from as 
few variables as possible. This is the classical trade-off between statistical fidelity and interpretability. 



For about a decade, sparse PCA has been a topic of active research. Historically, the first suggested 
approaches were based on ad -hoc methods involving post-processing of the components obtained from 
classical PCA. For example, IJoUiffd 1199511 consi ders using various ro t ation techniques to find sparse 
loading vectors in the subspace identified by PCA. ICadima and JoUiffd il995n propose to simply set to 
zero the PCA loadings which are in absolute value smaller than some threshold constant. 

In recent years, more involved approaches have been put forward — approaches that consider the 
conflicting goals of explaining variability and achieving representation sparsity simultaneously. These 
methods usually cast the sparse PCA problem in the form of an optimization program, aiming at maxi- 
mizing explained varia nce penalized f or the number of non-zero loadings. For instance, the SCoTLASS 
algorithm proposed by IJoUiffe et all ll2003n aims at maximizing the Rayle i gh quo ti ent of the covari- 
ance matrix of the data under the £i-norm based Lasso penalty (|Tibshiranil 11199611 '). IZou et al. i2006ll 
formulate sparse PCA as a regression-type optim ization problem and impose the Lasso penalty on the 
regression coefficients. Id'Aspremont et al.l 1200711 in their PS PCA algorithm exploit co nvex optimization 
tools to solve a convex relaxation of the sparse PCA problem. IShen and HuangI 1200811 adapt the singular 
value decomposition (SVD) to compute low-rank matrix approximations of the data matrix under various 
sparsity-induci ng penalties. Greedy met hods, whi ch are typical for combina torial problems, have been 
investigated by iMoghaddam et al. i2006ll . Finally, Id'Aspremont et all 1200811 propose a greedy heuristic 
accompanied with a certificate of optimality. 



In many applications, several components need to be identified. The traditional approach consists of 
incorporating an existing single-unit algo rithm in a deflation scheme, and computing the desired num- 
ber of components sequentially (see, e.g.. Id'Aspremont et al.l 1200711 ). In the case of Rayleigh quotient 
maximization it is well-known that computing several components at once instead of computing them 
one-by-one by deflation with the classical power method might present better convergence whenever the 
largest eigenvalues of the underlying matrix are close to each other (see, e.g., Parlett 1 1980ll ). Therefore, 
block approaches for sparse PCA are expected to be more efficient on ill-posed problems. 



In this paper we consider two single-unit (Section 2.1 and 2.3) and two block formulations (Section 
2.3 and 2.4) of sparse PCA, aimed at extracting m sparse principal components, with m = 1 in the for- 
mer case and p > m > 1 in the latter. Each of these two groups comes in two variants, depending on the 
type of penalty we use to enforce spai^sity — either £i or £q (cardinality). Q While our basic formulations 

Our single-unit cardinality-penalized formulation is identical to that of Id'Aspremont et al. I I2OO8II . 
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involve maximization of a noncomex function on a space of dimension involving n, we construct refor- 
mulations that cast the problem into the form of maximization of a convex function on the unit Euclidean 
sphere in (in the m = \ case) or the Stiefel manifol^m RP><™ (in the ?n > 1 case). The advantage of 
the reformulation becomes apparent when trying to solve problems with many variables (n » p), since 
we manage to avoid searching a space of large dimension. At the same time, due to the convexity of the 
new cost function we are able to propose and analyze the iteration-complexity of a simple gradient-type 
scheme, which appears to be well suited for problems of this form. In particular, we study (Section |3]l a 
first-order method for solving an optimization problem of the form 

r = max/(x), (P) 

where Q is a compact subset of a finite-dimensional vector space and / is convex. It appears that our 
method has best theoretical convergence properties when either f or Q are strongly convex, which is the 
case in the single unit case (unit ball is strongly convex) and can be enforced in the block case by adding 
a strongly convex regularizing term to the objective function, constant on the feasible set. We do not, 
however, prove any results concerning the quality of the obtained solution. Even the goal of obtaining 
a local maximizer is in general unattainable, and we must be content with convergence to a stationary 
point. 

In the particular case when Q is the unit Euclidean ball in R-^ and f{x) = x^Cx for some p x p 
symmetric positive definite matrix C, our gradient scheme specializes to the power method, which aims 
at maximizing the Rayleigh quotient 

x'^Cx 
Rix) = — Tf. — 
x^ X 

and thus at computing the largest eigenvalue, and the con^esponding eigenvector, of C. 

By applying our general gradient scheme to our sparse PCA reformulations of the form (P), we obtain 
algorithms (Section 4) with per-iteration computational cost 0{npm). 

We demonstrate on random Gaussian (Section 5.1) and gene expression data related to breast cancer 
(Section 5.2) that our methods are very efficient in practice. While achieving a balance between the 
explained variance and sparsity which is the same as or superior to the existing methods, they are faster, 
often converging before some of the other algorithms manage to initialize. Additionally, in the case of 
gene expression data our approach seems to extract components with strongest biological content. 

Notation. For convenience of the reader, and at the expense of redundancy, some of the less standard 
notation below is also introduced at the appropriate place in the text where it is used. Parameters m < 
p < n aie actual values of dimensions of spaces used in the paper. In the definitions below, we use 
these actual values (i.e. n,p and m) if the corresponding object we define is used in the text exclusively 
with them; otherwise we make use of the dummy variables k (representing p or n in the text) and I 
(representing m, p or n in the text). 

We will work with vectors and matrices of various sizes (R^', R'^^'). Given a vector y G R'^, its i^^ 
coordinate is denoted by For a matrix Y G R^'^', is the i^^ column of Y and yij is the element of 
Y at position 



^Stiefel manifold is tiie set of rectangular matrices with orthonormal columns. 
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By E we refer to a finite-dimensional vector space; E* is its conjugate space, i.e. the space of all 
linear functionals on E. By {s, x) we denote the action of s G E* on x G E. For a self-adjoint positive 
definite linear operator G : E ^ E* we define a pair of norms on E and E* as follows 

= x G E, 

(2) 



def 



Although the theory in Section [3] is developed in this general setting, the sparse PC A applications 
considered in this paper require either the choice E = E* = (see Section [331 and problems ([8]l and 
(fT4l) in Section El) or E = E* = R^^" (see Section [Ml and problems ^ and ^ in Section El). In 
both cases we will let G be the corresponding identity operator for which we obtain 

/ xl/2 

{x,y) =^Xiyi, ||x|| = = ( y^x^ I = ||a;||2, x,yGRP,and 




{x,Y) =TrX^y, ||x|| = = Vx^. = \\x\\f, x,y G W"^. 



Thus in the vector setting we work with the standard Euclidean norm and in the matrix setting with 
the Frobenius norm. The symbol Tr denotes the trace of its argument. 

Furthermore, for z G R" we write ||z||i = \zi \ (£i norm) and by ||z||o (i?o "norm") we refer to the 
number of nonzero coefficients, or cardinality, of z. By we refer to the space of all p x p symmetric 
matrices; S!j_ (resp. S!j_^) refers to the positive semidefinite (resp. definite) cone. Eigenvalues of matrix 
Y are denoted by \i{Y), largest eigenvalue by X^axiX)- Analogous notation with the symbol a refers to 
singular values. 

By = {y ^ R'^ | y^y < 1} (resp. = {y G R'^ | y^y = 1}) we refer to the unit Euclidean 
ball (resp. sphere) in R^ . If we write B and S, then these are the corresponding objects in E. The space 
of n X m matrices with unit-norm columns will be denoted by 

[5"]" = {y G R"^'" I Diag(y^y) = /^}, 

where Diag(-) represents the diagonal matrix obtained by extracting the diagonal of the argument. Stiefel 
manifold is the set of rectangular matrices of fixed size with orthonormal columns: 

= {y G Rp^™ I y^y = 

For t G R we will further write sign(t) for the sign of the argument and ^4. = max{0, t}. 



2 Some formulations of the sparse PCA problem 

In this section we propose four formulations of the sparse PCA problem, all in the form of the general op- 
timization framework (P). The first two deal with the single-unit sparse PCA problem and the remaining 
two ai^e their generalizations to the block case. 
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2.1 Single-unit sparse PCA via £i -penalty 

Let us consider the optimization problem 

'Piiil) == max V z^Sz — 7||2;||i, (3) 

with sparsity-controlling parameter 7 > and sample covariance matrix S = A^A. 

The solution z*{'y) of Q in the case 7 = is equal to the right singular vector corresponding to 
(^ma\{A), the largest singular value of A. It is the first principal component of the data matrix A. The 
optimal value of the problem is thus equal to 

4>iA0) = {K..{A^A))'/^ = a„ax(^). 

Note that there is no reason to expect this vector to be sparse. On the other hand, for large enough 7, we 
wiU necessarily have z*{'y) = 0, obtaining maximal sparsity. Indeed, since 

max —rr—r, — = max < max — — j — = max ||aj||2 = ||aj* ||2, 

z^O \\z\\i z^O \\z\\i 2^0 * 

we get ||Az||2 — 7||z||i < for all nonzero vectors z whenever 7 is chosen to be strictly bigger than 
||ai* II2. From now on we will assume that 

7<l|oi-||2- (4) 

Note that there is a trade-off between the value 1 1 Az* (7) 1 1 2 and the sparsity of the solution z* (7) . The 
penalty parameter 7 is introduced to "continuously" interpolate between the two extreme cases described 
above, with values in the interval [0, ||aj* II2). It depends on the particular application whether sparsity is 
valued more than the explained variance, or vice versa, and to what extent. Due to these considerations, 
we will consider the solution of Q to be a sparse principal component of A. 

Reformulation. The reader will observe that the objective function in ^ is not convex, nor concave, 
and that the feasible set is of a high dimension if p ^ n. It turns out that these shortcomings are overcome 
by considering the following reformulation: 

(^^^(7) = max \\Az\\2 - l\\z\\i 

= max maxx"'"^^ — 7||2;||i (5) 



max max y' Zi(aTx) — 'ylzA 



x&BP z£B 

1=1 

n 



= max max >^ IzjldaJ'xl — 7), (6) 

xGBPzGB"^ 
i=l 

where Zi = sign{afx)zi. In view of dUl, there is some x £ B"' for which afx > 7. Fixing such x, 
solving the inner maximization problem for z and then translating back to z, we obtain the closed-form 
solution 

^.(^)^sign(af.)[|afx|-7]+^ ^ = l,...,n. (7) 

/ELi[l«M-7]i 
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Problem ^ can therefore be written in the form 




(8) 



Note that the objective function is differentiable and convex, and hence all local and global maxima must 
lie on the boundary, i.e., on the unit Euclidean sphere S^. Also, in the case when p <^ n, formulation dSjl 
requires to seaixh a space of a much lower dimension than the initial problem Q. 

Sparsity. In view of dV]), an optimal solution x* of dSjl defines a sparsity pattern of the vector z*. In 
fact, the coefficients of z* indexed by 



I = {i\ |a/x*| >7} 



(9) 



ai^e active while all others must be zero. Geometrically, active indices correspond to the defining hyper- 
planes of the polytope 

V = {xeW \ \afx\ < 1} 

that are (strictly) crossed by the line joining the origin and the point x* /'j. Note that it is possible to say 
something about the sparsity of the solution even without the knowledge of x*: 



7 > IWih 



<(7) = 0, 



1, . . . ,n. 



(10) 



2.2 Single-unit sparse PCA via cardinality penalty 

Instead of the ^i-penaUzation, Id'Aspremont et al. i2008ll consider the formulation 



(pioij) = maxz T.z-'y \\z\\o, 
which directly penalizes the number of nonzero components (cai^dinality) of the vector z. 
Reformulation. The reasoning of the previous section suggests the reformulation 

(pioil) = max max(x"^j4z)'^ — 7||2;||o, 

where the maximization with respect to z G B'"' for a fixed x ^ has the closed form solution 



(11) 



[sign((af x)2 - 7)] + af 3;_ 



\/Efc=i[sign((a^a;)^ - l)]+{alx) 
In analogy with the £i case, this derivation assumes that 



1 ,n. 



(12) 



(13) 



7 < ll'^' 



« 112) 



so that there is x G such that {aJxY — 7 > 0. Otherwise z* = is optimal. Formula ([T3] ) is easily 
obtained by analyzing ([T2l ) sepai^ately for fixed cardinality values of z. Hence, problem (ITTI) can be cast 
in the following form 

n 

" ^ " (14) 
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Again, the objective function is convex, albeit nonsmooth, and the new search sp ace is of particular 



interest if p <^ n. A different derivation of ([T4l ) for the n = p case can be found in Id'Aspremont et al 
ll2008l] . 



Sparsity. Given a solution x* of (fT4l ). the set of active indices of z* is given by 

I={i\{ajx*f>l}. 
Geometrically, active indices con^espond to the defining hyperplanes of the polytope 

V = {x£BF \ \ajx\ < 1} 
that ai^e (strictly) crossed by the line joining the origin and the point x*/^. As in the £i case, we have 



7 > IWil 



4(7) = 0, 



1, 



(15) 



2.3 Block sparse PCA via -penalty 

Consider the following block generalization of dU, 



</'£i,m(7)= max Tr{X^AZN) 



m n 

j=i i=i 



(16) 



where 7 > is a sparsity-controlling parameter and N = Diag(/ii, . . . , Hm), with positive entries on 
the diagonal. The dimension m con^esponds to the number of extracted components and is assumed to 
be smaller or equal to the rank of the data matrix, i.e., m < Rank(A). It will be shown below that under 
some conditions on the parameters /Uj, the case 7 = recovers PCA. In that particular instance, any 
solution Z* of ([T6l ) has orthonormal columns, although this is not explicitly enforced. For positive 7, 
the columns of Z* are not expected to be orthogona l anymore. Most existing a l gorith r ns for computing 
severa l sparse principal components, e.g.. lZou et all ll2006ll . Id'Aspremont et al.l 1200711 . IShen and Huang 
120081], also do not impose orthogonal loading directions. Simultaneously enforcing sparsity and orthog- 
onality seems to be a hard (and perhaps questionable) task. 

Reformulation. Since problem (fT6l ) is completely decoupled in the columns of Z, i.e.. 



4'ei,m{'y) = max ^ max fijxjAzj - 7ll^illi, 



3=1 



the closed-form solution (|7]) of ([Sll is easily adapted to the block formulation ([T6l ): 

sign{afxj)[fij\ajxj\ -7] + 



4(7) 



(17) 



This leads to the reformulation 



(7) = max y^[M} 



Xj\ 



7]l, 



j=i i=i 



(18) 
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which maximizes a convex function / : RP^™ ^ R on the Stiefel manifold 5m- 

Sparsity. A solution X* of (fTSl ) again defines the sparsity pattern of the matrix Z*: the entry z*j is 
active if 

fij\ajx*\ > 7, 

and equal to zero otherwise. For 7 > maxj j /ij||aj||2, the trivial solution Z* = is optimal. 
Block PCA. For 7 = 0, problem ([TSl l can be equivalently written in the form 

(pj^^rni^) = max Tr{X'^ AA^XN^), (19) 

X^Sm 



which has been well studied (see e.g., iBrocketti 11199 ih and lAbsil et all ll2008h ). The solutions of ([T9l ) 



span the dominant m-dimensional invariant subspace of the matrix AA^. Furthermore, if the parameters 
fii are all distinct, the columns of X* are the m dominant eigenvectors of AA'^, i.e., the m dominant left- 
eigenvectors of the data matrix A. The columns of the solution Z* of ([T6l ) are thus the m dominant right 
singular vectors of A, i.e., the PCA loading vectors. Such a matrix N with distinct diagonal elements 
enforces the objective function in ([T9l ) to have isolated maximizers. In fact, if = Im, any point X*U 
with X* a solution of ([T9l ) and U G 5™ is also a solution of ([T9l ). In the case of sparse PCA, i.e., 7 > 0, 
the penalty term enforces isolated maximizers. The technical parameter A'^ will thus be set to the identity 
matrix in what follows. 

2.4 Block sparse PCA via cardinality penalty 

The single-unit cai'dinality-penalized case can also be naturally extended to the block case: 

<Pe,,mh)= max TT{Dmg{X'' AZNf) - -f\\Z\\o, (20) 

where 7 > is the sparsity inducing parameter and N = Diag(/ii, . . . , fim) with positive entries on 
the diagonal. In the case 7 = 0, problem (l22l l is equivalent to ( fT9l ) and therefore corresponds to PCA, 
provided that all ^li are distinct. 

Reformulation. Again, this block formulation is completely decoupled in the columns of Z, 

m 

^io,m{l) = max xn&yi{njx^ Azjf - 7||^j||o, 

so that the solution (fT3] ) of the single unit case provides the optimal columns Zi : 

4 = 4.(7) = , [^ign((A^.«fx,)^ - 7)]+/.,af X, ^^^^ 



(22) 



The reformulation of problem (I20b is thus 



,(7) = max ^^[(^j-afxj)^ -7]- 
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which maximizes a convex function / : RP^™ ^ R on the Stiefel manifold 5m- 

Sparsity. For a solution X* of (l22l ). the active entries z*,j of Z* are given by the condition 

(H^iXj) >7- 

Hence for 7 > max ^Uj ||aj|| 2, the optimal solution of (|20l ) is Z* = 0. 

3 A gradient method for maximizing convex functions 

By E we denote an ai^bitraiy finite-dimensional vector space; E* is its conjugate, i.e. the space of all 
lineal- functionals on E. We equip these spaces with norms given by Q. 

In this section we propose and analyze a simple gradient-type method for maximizing a convex 
function / : E ^ R on a compact set Q: 



f* = max/(x). 

xeQ 



(P) 



Unless explicitly stated otherwise, we will not assume / to be differentiable. By f'{x) we denote 
any subgradient of function / at x. By df{x) we denote its subdifferential. 

At any point x G Q we introduce some measure for the first-order optimality conditions: 

A(x) =^ max(/'(x), y — x). 
y&Q 

Clearly, A(x) > and it vanishes only at the points where the gradient /'(x) belongs to the normal cone 
to the set Conv(Q) at xH 

We will use the following notation: 

def 

y{x) G Argmax(/ (x), y — x). (23) 
y&Q 

3.1 Algorithm 

Consider the following simple algorithmic scheme. 

def 

Note that for example in the special case Q = r- 5 = r- {xGE| ||x||=r}or 
Q = r- ;S==r-{xGE| ||x|| < r}, the main step of Algorithm[T]can be written in an explicit form: 

G-^nxu) ,,,, 

y{.Xk) = Xk+i = r . (24) 

\\f {xk)\U 



^ The normal cone to the set Conv(Q) at 2: £ Q is smaller than the normal cone to the set Q. Therefore, the optimality 
condition A (a;) = is stronger than the standard one. 
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Algorithm 1: Gradient scheme 



input : Initial iterate xq G E. 

output: Xfc, approximate solution of (P) 

begin 

k< — 

repeat 

Xfc+i G Argmax{/(xfc) + {f'{xk),y- Xk) | y G Q} 
ki — k+l 
until a stopping criterion is satisfied 

end 



3.2 Analysis 

def 

Our first convergence result is Straightforward. Denote = min A(xi). 

0<i<fc 

Theorem 1 Let sequence {xk}^=Q be generated by Algorithm\J\as applied to a convex fimction f. Then 
the sequence {/(a;fc)}^o monotonically increasing and lim /S.{xk) = 0. Moreover, 

fc— ►oo 

A, < (25) 

K + 1 

Proof. From convexity of / we immediately get 

f{Xk+l) > f{Xk) + ifiXk) ) Xk+1 — Xk) — f{xk) + A(xfc), 

and therefore, f{xk+i) > f{xk) for all k. By summing up these inequalities for A; = 0, 1, . . . , — 1, 
we obtain 

fc 

/* - /(xo) > f{xk) - f{xo) > ^(^^)' 

i=0 

and the result follows. □ 
For a shaiper analysis, we need some technical assumptions on / and Q. 

Assumption 1 The norms of the subgradients of f are bounded from below on Q by a positive constant, 
i.e. 

5/=^ min ||/'(x)||, >0. (26) 
This assumption is not too binding because of the following result. 

Proposition 2 Assume that there exists a point x ^ Q such that f{x) < f{x)for all x £ Q. Then 



5f> 



min/(x) - f{x) 



I 



max X — x\ 



> 0. 



Proof. Because / is convex, for any x G Q we have 

< fix) - fix) < (/'(x),x - x) < ||/'(x)||, • ||x - x||. 

□ 

For our next convergence result we need to assume either strong convexity of / or strong convexity 
of the set Conv(Q). 
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Assumption 2 Function f is strongly convex, i.e. there exists a constant af > such that for any 
x,y gF, 

fiy) > fix) + {fix),y- x) + ^\\y- x\\\ {11) 

Convex functions satisfy this inequality for convexity parameter aj = 0. 

Assumption 3 The set Conv(Q) is strongly convex. This means that there exists a constant ctq > 
such that for any x,y G Conv(Q) and a £ [0, 1] the following inclusion holds: 

ax + (1 - a)y + ^"(1 - a)\\x - yf ■ S C Conv(Q). (28) 

Convex sets satisfy this inclusion for convexity parameter cjq = 0. It can be shown (see Appendix), 
that level sets of strongly convex functions with Lipschitz continuous gradient are again strongly convex. 
An example of such a function is the simple quadratic x ^ \\xf. The level sets of this function 
correspond to Euclidean balls of varying sizes. 

As we will see in TheoremlH a better analysis of Algorithm [T] is possible if Conv(Q), the convex 
hull of the feasible set of problem (P), is strongly convex. Note that in the case of the two formulations 
dHll and (fT4l) of the spai^se PCA problem, the feasible set Q is the unit Euclidean sphere. Since the convex 
hull of the unit sphere is the unit ball, which is a strongly convex set, the feasible set of our sparse PCA 
formulations satisfies Assumption |3] 

In the special case Q = r ■ S for some r > 0, there is a simple proof that Assumption |3]holds with 
(7q = Indeed, for any x, y G E and a G [0, 1], we have 

||ax + (1 - a)y||2 = a^\\xf + {l-a)'^\\yf + 2a{l-a){Gx,y) 

= a||x|p + (1 — — a(l — Ci)\\x — yf. 

Thus, for X, y G r • 5 we obtain: 

||ax + (1 — ct)y\\ = [r^ 

Hence, we can take cq = ^. 

The relevance of Assumption [3] is justified by the following technical observation. 
Proposition 3 Let Assumption\3\be satisfied. Then for any x £ Q the following holds: 

A(x)>^||/'(x)||,.||y(x)-xf. (29) 

Proof. Fix an arbitrary x G Q. Note that 

(/'(x),2/(x)-y) >0, yGConv(Q). 

We will use this inequality for 

y = ya = x + a{y{x) - x) + ^"(1 - oi)\\y{x) - xf ■ JTTT^yp' « ^ [0' 



- a(l - a)||x - yf]^'"^ <r - — a(l - a)||x - y 
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In view of Assumption [3l ya G Conv(Q). Therefore, 



> - y{x)) = (1 - a)(/'(x),x - y{x)) + ^a(l - a)\\y{x) - xf ■ 

Since a is an ai^bitrary value from [0, 1], the result follows. □ 

We are now ready to refine our analysis of Algorithm 1. 

Theorem 4 (Convergence) Let f be convex and let Assumption\J}and at least one of Assumptions^and 
\3]be satisfied. Ifixf^} is the sequence of points generated by Algorithm 1, then 

f; 11,,^, (30) 

Proof. Indeed, in view of our assumptions and Proposition [3l we have 

/(xfc+i) - f{xk) > A{xk) + ^\\xk+i - XkW'^ > ^{(JQ6f + fT/)||xfe+i - Xfclp. 

□ 

We cannot in general guarantee that the algorithm will converge to a unique local maximizer. In 
particular, if started from a local minimizer, the method will not move away from this point. However, 
the above statement guarantees that the set of its limit points is connected and all of them satisfy the 
first-order optimality condition. 

3.3 Maximization with spherical constraints 

Consider E = E* = RP with G = Ip and (s, x) = J2i ^i^i' ^^'^ 

Q = r-SP = {xeRP\ \\x\\ =r}. 

Problem (P) takes on the form: 



/* = max /(x). 



Since Q is strongly convex (ag = ^), Theorem|4]is meaningful for any convex function / (aj > 0). We 
have akeady noted (see (l24l) ) that the main step of Algorithm[T]can be written down explicitly. Note that 
the single-unit sparse PCA formulations ([H) and ([T4l) conform to this setting. The following examples 
illustrate the connection to classical algorithms. 

Example 5 (Power method) In the special case of a quadratic objective function f{x) = i^x'^Cxfor 
some C £ S^_|_ on the unit sphere (r = 1), we have 

f* = ^Knax{C), 

and Algorithm \J\ is equivalen t to the power iteration method for computing the largest eigenvalue of C 



eq uivalen t 

\Golub and Van Loan 1 199^1 ). Hence for Q = S^, we can think of our scheme as a generalization of the 



power method. Indeed, our algorithm performs the following iteration: 

Cxk 
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Note that both 5f and af are equal to the smallest eigenvalue ofC, and hence the right-hand side of diOD 
is equal to 

Example 6 (Shifted power method) If C is not positive semidefinite in the previous example, the ob- 
jective function is not convex and our results are not applicable. However, this complication can be 
circumvented by instead running the algorithm with the shifted quadratic function 

f[x) = ^x^{C + u;Ip)x, 

where w > satisfies C = ujlp + C G On the feasible set, this change only adds a constant term 

to the objective function. The method, however, produces different sequence of iterates. Note that the 
constants 6f and af are also affected and, correspondingly, the estimate l \31i . 



3.4 Maximization with orthonormality constraints 

Consider E = E* = R^^"*, the space of p x m real matrices, with m < p. Note that for m = 1 
we recover the setting of the previous section. We assume this space is equipped with the trace inner 

product: {X,Y) = Tr(X^y). The induced norm, denoted by \\X\\f = {X,Xy/'^, is the Frobenius 
norm (we let G be the identity operator). We can now consider various feasible sets, the simplest being 
a ball or a sphere. Due to nature of applications in this paper, let us concentrate on the situation when Q 
is a special subset of the sphere with radius r = ^/m, the Stiefel manifold Sm- 

Q = SP, = {Xe RP^" I X^X = I^}. 

Problem (P) then takes on the following form: 



/* = max f{X). 

X£Sm. 



Note that Conv(Q) is not strongly convex (cjg = 0), and hence Theorem |4] is meaningful only if / is 
strongly convex (fjj > 0). At every iteration, the algorithm needs to maximize a linear- function over the 
Stiefel manifold. The following standard result shows how this can be done. 

Proposition 7 Let C € R^^™, with m < p, and denote by (Ti{C), i = 1, . . . ,m, the singular values of 
C. Then 

m 

max (C, X) = Tr[(C7^C)^/2] ^ ^.^c), (32) 

and a maximizer X* is given by the U factor in the polar decomposition ofC: 

C = UP, Ue SP„ P G S"^. 
If C is of full rank, then we can take X* = C{C^C)~^^'^. 
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Proof. Existence of the polar factorization in the nonsquare case is covered by Theorem 7.3.2 in lHom and Johnson 
1 1985n . Let C = VTW'^ be the singular- value decomposition of A; that is, F is p x p orthonormal, W 
ism X m orthonormal, and S is p x m diagonal with values (Ti{A) on the diagonal. Then 



max (C, X) = max {VY.W^ , X) 



T 

T vT^ 



max TtY.{W'X' V) 



max TrSZ"^ = max ai{C)zii < ^^crj(C). 



The third equality follows since the function X h-> V^XW maps Sm onto itself. It remains to note 

that 

i i i 

Finally, in the full rank case we have (C, X*) = Tr C^C{C^Cy^/^ = Tr(C^C)i/2 

□ 

In the sequel, the symbol Uf (C) will be used to denote the U factor of the polar decomposition of 
matrix C G W"", or equivalendy, Uf (C) = C(C^C)-i/2 jf q of full rank. In view of the above 
result, the main step of Algorithm [J can be written in the form 

xk+i = Uf(/'(xfc)). (33) 



Note that the block sparse PCA formulations (1181 ) and (1221 ) conform to this setting. Here is one more 
example: 

Example 8 (Rectangular Procrustes Problem) Let C,X G rpx™ and D G Rp^p and consider the 
following problem: 

m\n{\\C -DXfp\X'^X = Im}. (34) 

Since \\C — DX\\'^p = \\C\^p + {DX, DX) —2{CD, X), by a similar shifting technique as in the previous 
example we can cast problem ( 1^41 ) in the following form 

max{u;||X|||, - {DX, DX) + 2{CD, X) \ X^X = I^}. 

For to > large enough, the new objective function will be strongly con vex. In this case our algorithm 
becomes similar to the gradient method proposed bv lFraikin et al. 

The standard Procrustes problem in the literature is a special case ofH34\l with p = m. 



4 Algorithms for sparse PCA 

The application of our general method (Algorithm [T]) to the four sparse PCA formulations of Section 
El i.e., dH), (O, (O and ([22]), leads to Algorithms [2l[3l Hand [5] below, that provide a locally optimal 
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pattern of sparsity for a matrix Z G [^'^J'^Q This pattern is defined as a matrix P S R"^^™ such that 
Pij = if the loading Zij is active and = 1 otherwise. So P is an indicator of the coefficients of Z 
that ai^e zeroed by our method. The computational complexity of the single-unit algorithms (Algorithms 
mandO is 0{np) operations per iteration. The block algorithms (Algorithms!?] and ID have complexity 
O {npm) per iteration. 

4.1 Methods for pattern- finding 



Algorithm 2: Single-unit sparse PCA method based on the ^i-penalty ^ 
input : Data matrix ^4 G R^^" 

Sparsity-controlling parameter 7 > 
Initial iterate x ^ 
output: A locally optimal sparsity pattern P 
begin 
repeat 

X i — YTi=i[\o-jA - l\+sign{ajx)ai 
until a stopping criterion is satisfied 

Pi = if \ajx\ > 7 



Construct vector P £ R" such that 
end 



Pi = I otherwise. 



Algorithm 3: Single-unit spai^se PCA algorithm based on the £o-penalty ([141 ) 

input : Data matrix ^ G R^^" 

Sparsity-controlling parameter 7 > 
Initial iterate x £ 
output: A locally optimal sparsity pattern P 
begin 
repeat 

X < — Er=i[sign((af 2;)^ - 7)]+ afx ai 
until a stopping criterion is satisfied 

Pi = if {ajxf > 7 



Construct vector P G R" such that 
end 



Pi = 1 otherwise. 



This section discusses the general block sparse PCA problem. The single-unit case corresponds to the particular case 
m — 1. 
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Algorithm 4: Block Spai-se PCA algorithm based on the £i-penalty ([TSl l 



input : Data matrix A G RP^" 

Sparsity-controlling parameter 7 > 
Initial iterate X G Sm 
output: A locally optimal spai^sity pattern P 
begin 
repeat 

for j = 1, ... ,m do 
L ' — Er=i[l«f^jl -'y]+sign{afx)a^ 
X < — Uf (X) 
until a stopping criterion is satisfied 

Pij = if \afxj\ > 



Construct matrix P G R"^™ such that 
end 



Pij = 1 otherwise. 



Algorithm 5: Block Spai^se PCA algorithm based on the £o-penalty 



input : Data matrix ^4 G R^^" 

Sparsity-controlling parameter 7 > 
Initial iterate X G Sm 
output: A locally optimal sparsity pattern P 
begin 
repeat 

for j = 1, ... ,m do 
L ' — Er=i[sign((af 2:^)2 - 7)]+ afxj ai 
X < — Uf (X) 
until a stopping criterion is satisfied 

Pij =0 if [ajxjf > 7 



Construct matrix P G R"^™ such that 
end 



Pij = 1 otherwise. 



4.2 Post-processing 

Once a "good" sparsity pattern P has been identified, the active entries of Z still have to be filled. To 
this end, we consider the optimization problem, 

{X*,Z*) = arg max Tr(X^AZA^), (35) 

Zp=0 

where Zp denotes the entries of Z that are constrained to zero and N = Diag(//i, . . . , Hm) with strictly 
positive /ij. Problem ( [35l ) assigns the active part of the loading vectors Z to maximize the variance 
explained by the resulting components. By Zp, we refer to the complement of Zp, i.e., to the active 
entries of Z. In the single-unit case m = 1, an explicit solution of (l35l) is available, 

X* =u, 

Z*p =v and Z*p = 0, ^ ' 
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where auv"^ with a > 0, u ^ and v S is a rank one singular value decomposition of the matrix 

Ap, that coiTesponds to the submatrix of A containing the columns related to the active entries. 

Although an exact solution of ( [35l ) is hard to compute in the block case m > 1, a local maximizer 
can be efficiently computed by optimizing alternatively with respect to one variable while keeping the 
other ones fixed. The following lemmas provide an explicit solution to each of these subproblems. 

Lemma 9 For a fixed Z G [5"]™, a solution X* of 

max Tt{X^AZN) 

is provided by the U factor of the polar decomposition of the product AZN. 

Proof. See Proposition |7] □ 
Lemma 10 The solution 

Z* = aTg max TriX'^AZN), (37) 

Zp=0 

is at any point X E Sm defined by the two conditions Zp = [A^ XND)p and Zp = 0, where D is a 
positive diagonal matrix that normalizes each column of Z* to unit norm, i.e., 

D = V)-\ag{NX'^AA^XN)-^. 

Proof. The Lagrangian of the optimization problem ( [37] ) is 

Ai, As) = Ti{X^AZN) - Tr(Ai(Z^Z - I^)) - Tr(A^Z), 

where the Lagrangian multipliers Ai G ]^"ix™ and A2 G pt^x™ have the following properties: Ai is an 
invertible diagonal matrix and (A2)p = 0. The first order optimality conditions of (l37l) are thus 

A^XN - 2ZAi - A2 = 
Diag(Z^Z) = Lm 
Zp = 0. 

Hence, any stationary point Z* of ([37]) satisfies Z*p = (A^XND) p and Zp = 0, where D is a diagonal 
matrix that normalizes the columns of Z* to unit norm. The second order optimality condition imposes 
the diagonal matrix D to be positive. Such a D is unique and given hy D = Diag{N X^ AA^ X N)^ 2 . □ 



The alternating optimization scheme is summarized in Algorithm [6l which computes a local solution 
of (l35l) . It should be noted that Algorithm [6] is a postprocessing heuristic that, strictly speaking, is 
required only for the £1 block formulation (Algorithm |4l). In fact, since the cardinality penalty only 
depends on the spai^sity pattern P and not on the actual values assigned to Zp, a solution {X*, Z*) of 
Algorithms [3] or [5] is also a local maximizer of (l35l) for the resulting pattern P. This explicit solution 
provides a good alternative to Algorithm [6l In the single unit case with ii penalty (Algorithm |2l), the 
solution (l36l ) is available. 
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Algorithm 6: Alternating optimization scheme for solving (1351 ) 



input : Data matrix A G RPX" 

Sparsity pattern P G R"^^"" 
Matrix N = Diag(/xi, . . . , fim) 
Initial iterate X e Sm 

output: A local minimizer {X, Z) of ([35]) 

begin 
repeat 

Z < — A^XN 
Z i — Z Diag(Z^Z)-5 
Zpi — 
X < — m{AZN) 
until a stopping criterion is satisfied 





Computation of P 


Computation of Zp 


GPowerc^ 


Algorithm|2] 


Equation 


GPower^o 


Algorithm[3] 


Equation (fT3]l 


GPower^i,™ 


Algorithmll] 


Algorithm|6] 


GPowerfo,™ 


Algorithm|5] 


Equation dzB 



Table 1 : New algorithms for sparse PCA. 



4.3 Sparse PCA algorithms 

To sum up, in this paper we propose four sparse PCA algorithms, each combining a method to identify 
a "good" sparsity pattern with a method to fill the active entries of the m loading vectors. They are 
summarized in Table [TH 



4.4 Deflation scheme. 



For the sake of completeness, we recall a cl assical deflation process f or computing m sparse princi- 
pal components with a single-unit algorithm (|d'Aspremont et alJ 120071]). Let z G R" be a unit-norm 
sparse loading vector of the data A. Subsequent directions can be sequentially obtained by computing a 



dominant sparse component of the residual matrix A 



xz 



where x = Az is the vector that solves 



mill \\A — xz 



5 Numerical experiments 

In this section, we evaluate the proposed power algorithms against existing sparse PCA methods. Three 
competing methods are considered in this study: a greedy scheme aimed at computing a local maximizer 

^Our algorithms are named GPower where the "G" stands for generalized or gradient. 



18 



of ([TT]) dd'Aspremont et all i2008h . the SPCA algorithm dZou et all ll2006h ) and the sPCA-rSVP algo - 
rithm dShen and HuangI jlOOSh ). We do not include the DSPCA algorithm dd'Aspremont et all IiOOtIi ) 
in our numerical study. This method solves a convex relaxation of the sparse PCA problem and has a 
large computational complexity of 0{n^) compared to the other methods. Table |2] lists the considered 
algorithms. 

GPower^j^ Single-unit sparse PCA via £i-penalty 
GPowerff, Single-unit sparse PCA via ^o-penalty 
GPower^j^m Block sparse PCA via £i-penalty 
GPower£„.m Block sparse PCA via £o-penalty 
Greedy Greedy method 

SPCA SPCA algorithm 

rSVDf J sPCA-rSVD algorithm with an £i -penalty ("soft thresholding") 

rSVDff, sPCA-rSVD algorithm with an ^p-penalty ("hard thresholding") 



Table 2: Sparse PCA algorithms we compare in this section. 



These algorithms are compared on random data (Section [5?T] ) as well as on real data (Section [S!2l ). 
All numerical experiments are performed in MATLAB. Our implementations of the G Power algorithms 
are initialized at a point for which the associated sparsity pattern has at least one active element. In case 
of the single-unit algorithms, such an initial iterate x G 5^ is chosen parallel to the column of A with the 
largest norm, i.e., 

X = -r. — ^-fp, where i* = argmax ||oi||2. (38) 
||aj*||2 « 

For the block G Power algorithms, a suitable initial iterate X E Sm is constructed in a block- wise manner 

a& X = [x\X^], where x is the unit-norm vector (1381 ) and Xx_ G S^_i is orthogonal to x, i.e., x'^ X±_ = 

0. We stop the G Power algorithms once the relative change of the objective function is small: 



/(Xfc+i) - f{Xk) 
f{Xk) 



< € 
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M ATLAB irnplemen tati qns of the SPCA algorithrn and the greedy algorithm have been rendered available 
by lZou et al.l 120061] and ld'Aspremont et all i2008[]. We have , however, implemented the sPCA-rSVD al- 
gorithm on our own (Algorithm 1 in lshen and Huang 1 2008 1). and use it with the same stopping criterion 
as for the G Power algorithms. This algorithm initializes with the best rank-one approximation of the data 
matrix. This is done with the svds function in MATLAB. 

Given a data matrix A G R*'^", the considered sparse PCA algorithms provide m unit-norm spai^se 
loading vectors stored in the matrix Z G [5"]™ . The samples of the associated components are provided 
by the m columns of the product AZ. The variance explained by these m components is an impor- 
tant comparison criterion of the algorithms. In the simple case m = 1, the variance explained by the 
component Az is 

Var(z) = z^A'^Az. 

When z corresponds to the first principal loading vector, the variance is Var(z) — f^max 

{Af. In the case 

m > 1, the derived components are likely to be correlated. Hence, summing up the variance explained 
individually by each of the components overestimates the variance expl ained simultaneo usly by all the 
components. This motivates the notion of adjusted variance proposed by lZou et all ll2006ll . The adjusted 
variance of the m components Y = AZ is defined as 

AdjVar Z = TvR^, 
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where Y = QR is the QR decomposition of the components sample matrix Y (Q Sm and R is an 
m X m upper triangular- matrix). 

5.1 Random test problems 

All random data matrices A G rpx" considered in this section are generated according to a Gaussian 
distribution, with zero mean and unit variance. 

Trade-off curves. Let us first compare the single-unit algorithms, which provide a unit-norm sparse 
loading vector z G R". We first plot the variance explained by the extracted component against the 
cardinality of the resulting loading vector z. For each algorithm, the sparsity-inducing parameter is 
incrementally increased to obtain loading vectors z with a cai^dinality that decreases from ?i to 1. The 
results displayed in Figure [J are averages of computations on 100 random matrices with dimensions p = 
100 and n = 300. The considered sparse PCA methods aggregate in two groups: GPower^j, GPower^g, 
Greedy and rSVD^,j outperform the SPCA and the rSVD^j approaches. It seems that these latter methods 
perform worse because of the £i penalty term used in them. If one, however, post-processes the active 
part of z according to (l36l ). as we do in GPower^^, all sparse PCA methods reach the same performance. 




Cardinality 

Figure 1 : Trade-off curves between explained variance and cardinality. The vertical axis is the ra- 
tio Var(zsPCA)/ Var(2pcA), where the loading vector ^sPCA is computed by sparse PCA and zpcA 
is the first principal loading vector. The considered algorithms aggregate in two groups: GPowerfj, 
GPower^Q, Greedy and rSVO^^ (top curve), and SPCA and rSVD^^ (bottom curve). For a fixed car- 
dinality value, the methods of the first group explain more variance. Postprocessing algorithms SPCA 
and rSVD^j with equation l |36t . results, however, in the same performance as the other algorithms. 

Controlling sparsity with 7. Among the considered methods, the greedy approach is the only one to 
directly control the cardinality of the solution, i.e., the desired cardinality is an input of the algorithm. The 
other methods require a parameter controlling the trade-off between variance and cardinality. Increasing 
this parameter leads to solutions with smaller cardinality, but the resulting number of nonzero elements 
can not be precisely predicted. In Figure |2j we plot the average relationship between the parameter 7 and 
the resulting cardinality of the loading vector z for the two algorithms GPower^j and GPower^g. In view 
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of (ITOl ) (resp. (fT5])). the entries i of the loading vector z obtained by the GPower^^ algorithm (resp. the 
GPower^Q algorithm) satisfying 

hih < 7 (resp. Iloilla < 7) (39) 

have to be zero. Taking into account the distribution of the norms of the columns of A, this provides for 
every 7 a theoretical upper bound on the expected cardinaUty of the resulting vector z. 




Figure 2: Dependence of cardinality on the value of the sparsity-inducing parameter 7. In case of the 
G Powers J algorithm, the horizontal axis shows 7/||ai* ||2, whereas for the GPower£„ algorithm, we use 
y/y/\\<ii' Il2- The theoretical upper bound is therefor identical for both methods. The plots are averages 
based on 100 test problems of size p = 100 and n = 300. 



Greedy versus the rest. The considered sparse PCA methods feature different empirical compu- 
tational complexities. In Figure |3l we display the average time required by the sparse PCA algorithms 
to extract one sparse component from Gaussian matrices of dimensions p = 100 and n = 300. One 
immediately notices that the greedy method slows down significantly as cardinality increases, whereas 
the speed of the other considered algorithms does not depend on cardinality. Since on average Greedy is 
much slower than the other methods, even for low cardinalities, we discard it from all following numeri- 
cal experiments. 

Speed and scaling test. In Tables [3] and |4] we compare the speed of the remaining algorithms. Table 
[3] deals with problems with a fixed aspect ratio n/p = 10, whereas in Tabled p is fixed at 500, and 
exponentially increasing values of n are considered. For the GPower^j method, the sparsity inducing 
parameter 7 was set to 10% of the upper bound 7max = Iki'lb- For the GPower^y method, 7 was set 
to 1% of 

7max — lloj'lll in order to aim for solutions of comparable cardinalities (see ( 1391 )). These 
two parameters have also been used for the rSVD^^ and the rSVD^g methods, respectively. Concerning 
SPCA, the sparsity parameter has been chosen by trial and error to get, on average, solutions with similar 
cardinalities as obtained by the other methods. The values displayed in Tables [3] and |4] correspond to the 
average running times of the algorithms on 100 test instances for each problem size. In both tables, the 
new methods GPower^^ and GPower^g are the fastest. The difference in speed between GPower^j and 
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Cardinality 

Figure 3: The computational complexity of Greedy grows significantly if it is set out to output a 
loading vector of increasing cardinality. The speed of the other methods is unaffected by the cardinality 
target. 

GPower^Q results from different approaches to fill the active part of z: GPower^^ requires to compute a 
rank-one approximation of a submatrix of A (see Equation (l36l)). whereas the explicit solution ([T3l) is 
available to GPower^g. The Unear complexity of the algorithms in the problem size n is clearly visible in 
Table H 



p X n 


100 X 1000 


250 X 2500 


500 X 5000 


750 X 7500 


1000 X 10000 


GPower^i 


0.10 


0.86 


2.45 


4.28 


5.86 


GPower^Q 


0.03 


0.42 


1.21 


2.07 


2.85 


SPCA 


0.24 


2.92 


14.5 


40.7 


82.2 


rSVDf, 


0.21 


1.45 


6.70 


17.9 


39.7 


rSVD,„ 


0.20 


1.33 


6.06 


15.7 


35.2 


Table 3: 


Average computational time for the extraction of 


one component (in seconds). 


J) X n 


500 X 1000 


500 X 2000 


500 X 4000 


500 X 8000 


500 X 16000 


GPower^j 


0.42 


0.92 


2.00 


4.00 


8.54 


GPower^i, 


0.18 


0.42 


0.96 


2.14 


4.55 


SPCA 


5.20 


7.20 


12.0 


22.6 


44.7 


rSVD^, 


1.20 


2.53 


5.33 


11.3 


26.7 


rSVD,„ 


1.09 


2.26 


4.85 


10.5 


24.6 



Table 4: Average computational time for the extraction of one component (in seconds). 



Different convergence mechanisms. Figured illustrates how the trade-off between explained vari- 
ance and sparsity evolves in the time of computation for the two methods GPower^^ and rSVD^^. In case 
of the GPower^j algorithm, the initialization point (1381 ) provides a good approximation of the final car- 
dinality. This method then works on maximizing the variance while keeping the sparsity at a low level 
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throughout. The rSVD^^ algorithm, in contrast, works in two steps. First, it maximizes the variance, 
without enforcing sparsity. This coiTcsponds to computing the first principal component and requires 
thus a first run of the algorithm with random initialization and a sparsity inducing parameter set at zero. 
In the second run, this parameter is set to a positive value and the method works to rapidly decrease car- 
dinality at the expense of only a modest decrease in explained variance. So, the new algorithm GPower^^ 
performs faster primarily because it combines the two phases into one, simultaneously optimizing the 
trade-off between variance and sparsity. 




Figure 4: Evolution of the variance (solid lines and left axis) and cardinality (dashed lines and right 
axis) in time of computation for the methods GPowerf ^ and rSVD^j on a test problem with p = 250 
and n — 2500. The vertical axis is the ratio Var(2sPCA)/ Var(zpcA), where the loading vector Zspca 
is computed by sparse PCA and zpcA is the first principal loading vector. The rSVDf ^ algorithm first 
solves unconstrained PCA, whereas GPowerf immediately optimizes the trade-off between variance 
and sparsity. 



Extracting more components. Similar numerical experiments, which include the methods GPower^j , 
and GPower^p have been conducted for the extraction of more than one component. A deflation 
scheme is used by the non-block methods to sequentially compute m components. These experiments 
lead to similar conclusions as in the single-unit case, i.e, the methods GPower^j, GPower^g, GPower^j 
GPower^,, .m and rSVD^,, outperform the SPCA and rSVD^j approaches in terms of variance explained at 
a fixed cardinality. Again, these last two methods can be improved by postprocessing the resulting load- 
ing vectors with Algorithm^ as it is done for GPower^^ m. The average running times for problems of 
various sizes are listed in Table [5] The new power-like methods are significantly faster on all instances. 
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p X n 


50 X 500 


100 X 1000 


250 X 2500 


500 X 5000 


750 X 7500 


GPower^j 


0.22 


0.56 


4.62 


12.6 


20.4 


GPower^f, 


0.06 


0.17 


2.15 


6.16 


10.3 


GPower£i^,„ 


0.09 


0.28 


3.50 


12.4 


23.0 


GPower£„.,„ 


0.05 


0.14 


2.39 


7.7 


12.4 


SPCA 


0.61 


1.47 


13.4 


48.3 


113.3 


rSVDf, 


0.30 


1.15 


7.92 


37.4 


97.4 


rSVDf„ 


0.28 


1.10 


7.54 


34.7 


85.7 



Table 5: Average computational time for the extraction of m = 5 components (in seconds). 



5.2 Analysis of gene expression data 



Gene expression data results from DNA microarrays and provide the expression level of thousands of 
genes across several hundreds of experiments. The interpretation of these huge databases remains a 
challenge. Of particular interest is the identific ation of genes tha t are systematically coexpressed un- 
der similar experimental conditions. We refer to iRiva et all i2005n and references therein for more de- 
tails on microarra ys and gene expression data. PC A has been intensively applied in this context (e.g., 
Alter et all 1200311 1. F urther methods for dimension reducti on, such as indepen dent component analysis 
(ILiebermeisteii 1200211 ) or nonnegative matrix factorization (IBrunet et al.l 1200411 ). have also been used on 
gene expression data. Sparse PCA, which extracts components involving a few genes only, is expected 
to enhance interpretation. 



Data sets. The results below focus on four major data sets related to breast cancer. They are briefly 
detailed in Table |6l Each sparse PCA algorithm computes ten components from these data sets. 



Study 


Samples (p) 


Genes (n) 


Reference 


Vijver 


295 


13319 


van de Viiver et al. [20021 


Wang 


285 


14913 


Wans et al. [20051 


Naderi 


135 


8278 


Naderi et al. [20071 


JRH-2 


101 


14223 


Sotiriou et al. [20061 



Table 6: Breast cancer cohorts. 



Speed. The average computational time required by the sparse PCA algorithms on each data set 
is displayed in Table |7] The indicated times are averages on all the computations performed to obtain 
cardinality ranging from n down to 1 . 





Vijver 


Wang 


Naderi 


JRH-2 


GPower^^ 


7.72 


6.96 


2.15 


2.69 


GPower^g 


3.80 


4.07 


1.33 


1.73 


GPower^i,™ 


5.40 


4.37 


1.77 


1.14 


GPower^o,™ 


5.61 


7.21 


2.25 


1.47 


SPCA 


77.7 


82.1 


26.7 


11.2 




46.4 


49.3 


13.8 


15.7 


rSVD,„ 


46.8 


48.4 


13.7 


16.5 



Table 7: Average computational times (in seconds). 
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Trade-off curves. Figure [5] plots the proportion of adjusted variance versus the cardinahty for the 
"Vijver" data set. The other data sets have similar plots. As for the random test problems, this per- 
formance criterion does not discriminate among the different algorithms. All methods have in fact the 
same performance, provided that the SPCA and rSVD^^ approaches are used with postprocessing by 
Algorithm [6l 




Figure 5: Trade-off curves between explained variance and cardinality (case of the "Vijver" data). 
The vertical axis is the ratio AdjVar(ZsPCA)/ AdjVar(ZpcA), where the loading vectors Zspca are 
computed by sparse PCA and ZpcA are the m first principal loading vectors. 



Interpretability. A more interesting performance criterion is to estimate the biol ogical interpretabil- 



ity of the extracted components. The pathway enrichment index (PEI) proposed by iTeschendorff et al 



1 200711 measures the statistical significance of the overlap between two kinds of gene sets. The first sets 
are inferred from the computed components by retaining the most expressed genes, whereas the second 
sets result from biological knowledge. For instance, metabolic pathways provide sets of genes known to 
participate together when a certain biological function is required. An alternative is given by the regula- 
tory motifs: genes tagged with an identical motif are likely to be coexpressed. One expects sparse PCA 
methods to recover some of these biologically significant sets. Table [8] displays the PEI based on 536 
metabolic pathways related to cancer. The PEI is the fraction of these 536 sets presenting a statistically 
significant overlap with the genes inferred from the sparse principal components. The values in Table 
[8] correspond to the largest PEI obtained among all possible cardinalities. Similarl y, Table |9| is based 



on 17 3 motifs. More details on the selected pathways and motifs can be found in ITeschendorff et al 



1 200711 . This analysis clearly indicates that the sparse PCA methods perform much better than PCA in 
this context. Furthermore, the new G Power algorithms, and especially the block formulations, provide 
largest PEI values for both types of biological information. In terms of biological interpretability, they 
systematically outperform previously published algorithms. 
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Vijver 


Wang 


Naderi 


JRH-2 


PCA 


0.0728 


0.0466 


0.0149 


0.0690 


GPower^i 


0.1493 


0.1026 


0.0728 


0.1250 


GPower^j 


0.1250 


0.1250 


0.0672 


0.1026 


G Powers J, m 


0.1418 


0.1250 


0.1026 


0.1381 


GPowerfo^m 


0.1362 


0.1287 


0.1007 


0.1250 


SPCA 


0.1362 


0.1007 


0.0840 


0.1007 


rSVD^, 


0.1213 


0.1175 


0.0914 


0.0914 


rSVD,„ 


0.1175 


0.0970 


0.0634 


0.1063 



Table 8: PEI-values based on a set of 536 cancer-related pathways. 



Vijver Wang Naderi JRH-2 



PCA 


0.0347 





0.0289 


0.0405 


GPower^^ 


0.1850 


0.0867 


0.0983 


0.1792 


GPower^g 


0.1676 


0.0809 


0.0925 


0.1908 


GPower£i^„i 


0.1908 


0.1156 


0.1329 


0.1850 


GPower£o.„i 


0.1850 


0.1098 


0.1329 


0.1734 


SPCA 


0.1734 


0.0925 


0.0809 


0.1214 


rSVDf, 


0.1387 


0.0809 


0.1214 


0.1503 


rSVD,„ 


0.1445 


0.0867 


0.0867 


0.1850 



Table 9: PEI-values based on a set of 173 motif-regulatory gene sets. 



6 Conclusion 

We have proposed two single-unit and two block formulations of the sparse PCA problem and con- 
structed reformulations with several favorable properties. First, the reformulated problems are of the 
form of maximization of a convex function on a compact set, with the feasible set being either a unit 
Euclidean sphere or the Stiefel manifold. This structure allows for the design and iteration complexity 
analysis of a simple gradient scheme which applied to our sparse PCA setting results in four new al- 
gorithms for computing sparse principal components of a matrix A G R^^". Second, our algorithms 
appear to be faster if either the objective function or the feasible set are strongly convex, which holds in 
the single-unit case and can be enforced in the block case. Third, the dimension of the feasible sets does 
not depend on n but on p and on the number m of components to be extracted. This is a highly desirable 
property if p <C n. Last but not least, on random and real-life biological data, our methods systemat- 
ically outperform the existing algorithms both in speed and trade-off performance. Finally, in the case 
of the biological data, the components obtained by our block algorithms deliver the richest biological 
interpretation as compared to the components extracted by the other methods. 
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7 Appendix A 



In this appendix we characterize a class of functions with strongly convex level sets. First we need 
to collect some basic preliminary facts. All the inequahties of Proposition [TT] are well-known in the 
literature. 

Proposition 11 (i) If f is a strongly convex function with convexity parameter aj, then for all x,y 
and < a < 1, 

f{ax + (1 - a)y) < af{x) + (1 - a)f{y) - ^a(l - a)\\x - yf. (40) 

( ii) If f is a convex differentiable function and its gradient is Lipschitz continuous with constant L f, 
then for all x and h, 

fix + h)< fix) + {f'ix),h) + ^\\hf, (41) 

and 



'ix)\U<J2Lfifix)-f,), (42) 



def 

where = min^gE fix). 



We are now ready for the main result of this section. 

Theorem 12 (Strongly convex level sets) Let / : E — > R Z>e a nonnegative strongly convex function 
with convexity parameter af > 0. Also assume f has a Lipschitz continuous gradient with Lipschitz 
constant Lf > 0. Then for any u > 0, the set 

Qu.=^{x\ fix)<L0} 



is strongly convex with convexity parameter 



Proof. Consider any x,y e Quj, scalar < a < 1 and let Za = ax + (1 — a)y. Notice that by convexity, 
fiza) < ^. For any n G E, 

fiz^ + u)^fiz^) + ifiza), u) + ^\\uf 



</(z,) + ||/'(z,)||||n|| + ^||n||2 

Li 



< /(za) + ^2Lj/(z,)h|| + ^||n| 
^/JW) + \f^\\u\\j 



m 

< 
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where 

(3 = ^a{l-a)\\x-yf. (43) 

In view of (|28] ). it remains to show that the last displayed expression is bounded above by u whenever u 
is of the form 

u = —a(l - a)\\x - yPs = — ^ a(l - a)\\x - yfs, (44) 

for some s G 5. However, this follows directly from concavity of the scalar function g{t) = \/t: 
^JZi^ = g{u -I3)< gicv) - {g'{u;), /3) 




□ 

Example 13 Let f{x) = Hxp. Note that aj = Lf = 2. If we let u> = r^, then 

Qui = {x I f{x) < io} = {x \ \\x\\ < r} = r • B. 

We have shown before (see the discussion immediately following Assumption\3^, that the strong convexity 
parameter of this set is cjg^ = p Note that we recover this as a special case ofTheorem\12\ 

aj 1 
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